Quantifying impairment and disease severity using AI models trained on healthy subjects

Automatic assessment of impairment and disease severity is a key challenge in data-driven medicine. We propose a framework to address this challenge, which leverages AI models trained exclusively on healthy individuals. The COnfidence-Based chaRacterization of Anomalies (COBRA) score exploits the decrease in confidence of these models when presented with impaired or diseased patients to quantify their deviation from the healthy population. We applied the COBRA score to address a key limitation of current clinical evaluation of upper-body impairment in stroke patients. The gold-standard Fugl-Meyer Assessment (FMA) requires in-person administration by a trained assessor for 30-45 minutes, which restricts monitoring frequency and precludes physicians from adapting rehabilitation protocols to the progress of each patient. The COBRA score, computed automatically in under one minute, is shown to be strongly correlated with the FMA on an independent test cohort for two different data modalities: wearable sensors (ρ = 0.814, 95% CI [0.700,0.888]) and video (ρ = 0.736, 95% C.I [0.584, 0.838]). To demonstrate the generalizability of the approach to other conditions, the COBRA score was also applied to quantify severity of knee osteoarthritis from magnetic-resonance imaging scans, again achieving significant correlation with an independent clinical assessment (ρ = 0.644, 95% C.I [0.585,0.696]).


Activity
Workspace Target object(s)

Instructions
Face-wash Sink with a small tub (32.Supplementary Tables 1 and 2 provide a detailed description of the rehabilitation activities carried out by the subjects in the dataset used for quantification of stroke-induced impairment.
Supplementary Tables 3, 4, 5 and 6 report the accuracy and precision of the AI models for stroke functional-primitive prediction described in the Methods section.
Supplementary Table 7 reports the voxel-wise accuracy and precision of the AI model for segmentation of MRI scans described in the Methods section.

Supplementary Figures
Supplementary Figure 1 shows examples of the MRI scans used for quantification of knee-osteoarthritis severity.Supplementary Figures 2, and 3 show scatterplots of the FMA and COBRA scores for each rehabilitation activity.
In the main article (Results section, Figure 5), we show that object color is a confounding factor, which can spuriously reduce model confidence and therefore distort the COBRA score.To complement this observation, we analyzed the impact of varying video resolution on the COBRA score.We blurred half of the videos (chosen at random), reducing their resolution by a factor of 16 along each axis and then restoring them to their original dimensions.Supplementary Figure 4 shows the results of applying the COBRA score to a dataset containing the blurred and non-blurred videos.Blurring acts as a confounding factor, producing a spurious decrease in model confidence independent of impairment, which reduces the correlation between FMA and the COBRA score.This can be corrected by stratifying the videos, separating them according to whether they are blurred or not.  3 Supplementary Methods

Robustness of the COBRA Score to the Underlying AI Model
In order to evaluate the robustness of the proposed approach to the choice of underlying AI model, we performed experiments with alternative models for both of our applications of interest.In both cases, we found that the COBRA score based on the alternative models is still correlated with the gold-standard reference scores, indicating that the proposed approach is indeed robust to the choice of the underlying AI model.For quantification of stroke-induced impairment, Supplementary Figure 5 shows a scatterplot of the gold-standard Fugl-Meyer assessment (FMA) score, and the proposed COBRA score computed from wearable-sensor data using a sequence-to-sequence model based on recurrent neural networks (see below for a detailed description), which is completely different from the MS-TCN segmentation model used to obtain our main results.The correlation coefficient between the resulting COBRA score and the FMA score is again high: 0.774 (95% CI [0.636, 0.865]).
For quantification of knee-osteoarthritis severity, Supplementary Figure 6 shows a scatterplot and density plots of COBRA scores computed using a 3D U-Net, described in detail below, which is again different from the Multi-Planar U-Net model used to obtain our main results.The magnitude of the correlation coefficient between the resulting COBRA score and the gold-standard Kellgren-Lawrence grade is lower, but still statistically significant: -0.429 (95% CI [-0.503,-0.349]).

Stroke-Related Motor Impairment
As an alternative AI model to compute the COBRA score for quantification of stroke-induced impairment, we utilize the sequence-to-sequence model proposed in Kaku et al (2022); Parnandi et al (2022).The model consists of an encoder and a decoder, both implemented using recurrent neural networks.The encoder module is a three-layer bidirectional gated-recurrent-unit (GRU) network, with a 1024-dimensional hidden representation, whereas the decoder is another one-layer bidirectional GRU, with a 2048 dimensional hidden representation.
The model was trained on the healthy cohort minimizing a label-smoothed cross-entropy loss (with a smoothing factor of 0.1) via stochastic gradient descent.We used the Adam optimizer with a learning rate of 5 • 10 −4 , and adjusted the learning rate with a 1cycle policy Smith and Topin (2019).Additional hyperparameters include a dropout rate of 0.1 and weight decay of 0.0001 (selected via cross-validation).5 Robustness to the choice of AI model for quantification of stroke impairment.Scatterplot of the Fugl-Meyer assessment (FMA) score, based on in-person examination by an expert, and the proposed data-driven COBRA score computed from wearable-sensor data using a different AI model (described in Supplementary Method Section 3.1.1)from the one in main article Figure 3(a) ).The correlation between the COBRA and FMA scores is again high, indicating that the proposed approach is robust to the choice of underlying AI model.

Knee-Osteoarthritis Severity
As an alternative AI model to compute the COBRA score for quantification of knee-osteoarthritis severity, we use a 3D U-Net C ¸içek et al (2016).This model is a popular baseline for 3D volumetric segmentation tasks in medical applications Perslev et al (2019).The 3D U-Net has an encoder-decoder architecture with skip connections between corresponding layers of the encoder and decoder.Following Perslev et al (2019) we use three layers in the encoder and decoder.We train the 3D U-Net on the training cohort using the same training loss, optimizer, and early stopping rule as for the model described in the Methods section.

Distance-based Anomaly Quantification
In this section we present an alternative method for anomaly detection and quantification that utilizes an AI model trained only on healthy patients.The method is based on the Fréchet Inception Distance (FID) Heusel et al (2017) to quantify the deviation between a subject and a healthy population.FID is a metric designed to evaluate the similarity between two sets of feature representations extracted by a deep neural network.It has been applied to image generation Heusel et al (2017), where the goal is to determine whether generated images are close to real images or not.
We propose to leverage FID to compare a potentially impaired subject to a healthy reference population using the same model features as in the COBRA framework.First, the data associated with all individuals is fed into a deep neural network, trained to perform a task relevant to the impairment or disease of interest.Then, the features extracted by the neural network are compared via FID to determine to what extent the subject deviates from the population.
between video COBRA score and clinical assessment for individual rehabilitation activities.Scatterplots of the Fugl-Meyer assessment (FMA) score, based on in-person examination by an expert, and the proposed data-driven COBRA score computed from video data for individual rehabilitation activities.The correlation coefficient ρ is highest for simpler more structured activities such as Glasses, Shelf and Table-top.

Table 4
Supplementary Table2Description of the activities performed by the subjects in the dataset used for quantification of stroke-induced impairment (2/2).Performance of the AI model used to compute the COBRA score for quantification of stroke-induced impairment from wearable-sensor data on subjects with different levels of impairment.Performance degrades as the impairment level increases.The metrics are defined as in Supplementary Table3.95%CIs are shown in brackets.Based on Fugl-Meyer Assessment: 0-25 is severe, 26-52 is moderate, 53-65 is mild, and 66 is healthy.Performance of the AI model used to compute the COBRA score for quantification of stroke-induced impairment from video data on held-out healthy subjects.The metrics are defined as in Supplementary Table3.95%CIs are shown in brackets.Performance of the AI model used to compute the COBRA score for quantification of stroke-induced impairment from video data on subjects with different levels of impairment.Performance degrades as the impairment level increases.The metrics are defined as in Supplementary Table3.95% CIs are shown in brackets.Based on Fugl-Meyer Assessment: 0-25 is severe, 26-52 is moderate, 53-65 is mild, and 66 is healthy.Voxel-wise performance of the AI models used to compute the COBRA score for quantification of knee-osteoarthritis severity from MRI scans on held-out healthy subjects.95% CIs are shown in brackets.Knee MRI images for subjects with different Kellgren-Lawrence (KL) grades (top) and corresponding segmentation annotations (bottom) indicating the tissue in each voxel.
1 1 Correlation between wearable-sensor COBRA score and clinical assessment for individual rehabilitation activities.Scatterplots of the Fugl-Meyer assessment (FMA) score, based on in-person examination by an expert, and the proposed data-driven COBRA score computed from wearable-sensor data for individual rehabilitation activities.The correlation coefficient ρ is highest for simpler more structured activities such as Glasses, Shelf and Table-top.